You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry in advance for this being super long for a relatively small problem.
I ran across another instance of a movie getting matched to the wrong RT ratings and have been thinking about how to improve the RT search logic and make it generally more robust. I'd like to have a discussion about some of these ideas and whether or not they're worthwhile, and then I'll be happy to open a PR implementing whatever is decided on.
For starters, the specific mismatch that prompted this involves the movie "The OctoGames" (/movie/1028703 in Jellyseerr) which ends up showing the RT ratings for the unrelated movie "The Stranger." The reason for this failure has two parts:
TMDB lists the title for this movie as "The OctoGames" while RT has the title as just "Octogames" (note both the lack of "The" and the lowercase "g") with no alternative titles. Searching the RT algolia API for "The OctoGames" doesn't return the correct movie anywhere in the list of results. The movie is on RT, but you need to search without "The" to find it.
The Jellyseerr RT search result matching logic does a desperate pass through the results looking for any result with a matching release year, and will use that result even if the titles don't match at all. In this case, the RT results are just a pile of random popular movies that have "The" in the title, and one of them happens to have the same release year as Octogames, so it gets selected.
There are, roughly, two separate issues to address here. One is normalizing search queries before sending them to the RT algolia API, which may be as simple as stripping "the" from search queries, since the API seems to be sensitive to it. The other issue is Jellyseerr's result selection logic lacking robustness when presented with low quality search results, which is my main focus here.
For context, the current result selection logic for movies performs 4 passes over the search results, with each pass trying to grab the first movie that satisfies the condition of the pass, and moves on to the next pass if no results match that condition. The conditions of the 4 passes are:
Exact title and exact year match
Partial title (via String.includes) and exact year match
Exact year match only (!)
Exact title match only
With the third pass being the most problematic one that caused the mismatch in this specific case.
Here is a rough (and by no means complete) collection of potential data mismatches that should be accounted for:
Capitalization differences in titles
Minor article and punctuation differences in titles, such as missing "The" or "A", or ":" vs "-"
Alternate titles being treated as primary titles
Off-by-one years, for smaller movies with complicated release stories that happened near the end/beginning of a year
An interesting observation here is that both titles and years can have three potential match states, which I'll refer to as "exact," "partial," and "none." This makes me think that the best approach would be to scan the list of search results once, assign a "score" to each result based on multiplying together subscores from how closely the titles and years match, and then picking the result with the highest score above a minimum threshold. An exact match would contribute a subscore of 1, a non-match would contribute a subscore of 0 (forcing the final score to 0 as well), and a partial match would contribute a subscore between 0 and 1 potentially based on how close the match is.
Here is my proposed scoring logic so far:
Year:
exact match: 1
off by one: 0.5 (?)
anything else: 0
Title:
exact match between normalized (?) titles: 1
partial match via any sensible string similarity algo (?): the score of that algo
Sufficiently low (?) similarity: 0
Under this scheme, exact title and year matches would have a score of 1 and automatically win. Results that have very different titles or years that are more than 1 off would have a score of 0 and automatically lose. Anything in between will compete for the highest title*year score.
I put (?)s in places of the scoring logic above where I would particularly like feedback.
The 0.5 score for off-by-one years is arbitrary. I don't think the exact value matters too much, but I could see arguments existing for lowering (0.3?) or raising (0.8?) it.
"Normalizing" is pretty open ended. I envision it as lowercasing, stripping non-alphanumeric characters like ":", and removing minor articles like "the," but I'm not married to any of those details.
A string similarity algo could be as simple as the one already being used (ie, string-contains) or as complicated as something like jaro-distance. I don't think this needs to be particularly robust, and we'd probably be taking on an additional library dependency if we decide to go the jaro-distance route, so I think it's reasonable to stick with a simpler check if desired.
Depending on the string similarity algorithm, a cutoff score should be determined where values below that should be forced to 0, so we don't accidentally end up with cases where very different titles and similar years end up with a very low but non-zero total score and still win because everything else is 0. If we go with jaro, that might be "scores under 0.3 become 0." If we go with string-contains, it might just be "substring match -> 0.5, no match -> 0"
Also note that the RT algolia API returns alternate titles, but we don't currently use them. I intend to check all of them in the updated logic, and not just the primary title.
If this approach sounds good then I'll get started on a PR. Please let me know about any thoughts you have or tweaks that should be made.
Version
latest develop
Steps to Reproduce
Go to /movie/1028703
Click RT ratings
Observe being linked to the wrong RT movie page
Screenshots
Logs
No response
Platform
desktop
Database
SQLite (default)
Device
Any
Operating System
Any
Browser
Any
Additional Context
No response
Code of Conduct
I agree to follow Jellyseerr's Code of Conduct
The text was updated successfully, but these errors were encountered:
Description
Sorry in advance for this being super long for a relatively small problem.
I ran across another instance of a movie getting matched to the wrong RT ratings and have been thinking about how to improve the RT search logic and make it generally more robust. I'd like to have a discussion about some of these ideas and whether or not they're worthwhile, and then I'll be happy to open a PR implementing whatever is decided on.
For starters, the specific mismatch that prompted this involves the movie "The OctoGames" (
/movie/1028703
in Jellyseerr) which ends up showing the RT ratings for the unrelated movie "The Stranger." The reason for this failure has two parts:There are, roughly, two separate issues to address here. One is normalizing search queries before sending them to the RT algolia API, which may be as simple as stripping "the" from search queries, since the API seems to be sensitive to it. The other issue is Jellyseerr's result selection logic lacking robustness when presented with low quality search results, which is my main focus here.
For context, the current result selection logic for movies performs 4 passes over the search results, with each pass trying to grab the first movie that satisfies the condition of the pass, and moves on to the next pass if no results match that condition. The conditions of the 4 passes are:
String.includes
) and exact year matchWith the third pass being the most problematic one that caused the mismatch in this specific case.
Here is a rough (and by no means complete) collection of potential data mismatches that should be accounted for:
An interesting observation here is that both titles and years can have three potential match states, which I'll refer to as "exact," "partial," and "none." This makes me think that the best approach would be to scan the list of search results once, assign a "score" to each result based on multiplying together subscores from how closely the titles and years match, and then picking the result with the highest score above a minimum threshold. An exact match would contribute a subscore of 1, a non-match would contribute a subscore of 0 (forcing the final score to 0 as well), and a partial match would contribute a subscore between 0 and 1 potentially based on how close the match is.
Here is my proposed scoring logic so far:
Under this scheme, exact title and year matches would have a score of 1 and automatically win. Results that have very different titles or years that are more than 1 off would have a score of 0 and automatically lose. Anything in between will compete for the highest title*year score.
I put
(?)
s in places of the scoring logic above where I would particularly like feedback.Also note that the RT algolia API returns alternate titles, but we don't currently use them. I intend to check all of them in the updated logic, and not just the primary title.
If this approach sounds good then I'll get started on a PR. Please let me know about any thoughts you have or tweaks that should be made.
Version
latest develop
Steps to Reproduce
/movie/1028703
Screenshots
Logs
No response
Platform
desktop
Database
SQLite (default)
Device
Any
Operating System
Any
Browser
Any
Additional Context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: