-
Notifications
You must be signed in to change notification settings - Fork 2
Conceptualizing subtitles
To understand things, our mind needs to create an image of how things actually are. This image can be created by reading, by getting things explained, or by seeing some pictures. And sometimes things might be required to be explained from scratch even if they are already known concepts so that new points of view can be unlocked. We are going all the way in and using all the ways 😄 🚀.
Subtitles can be considered as a sequence of phrases and words distributed in a non-linear manner over time. These words are usually associated with the audio track, but it may depend on the media content. We can theoretically create a distinction:
- Subtitles are for people that do not understand a language and therefore need some support to understand them;
- Captions are for people that have hearing issues and that cannot hear correctly the audio. For them, additional details might be required;
In reality, technically speaking, both will be rendered in the same way: the only difference is the amount of information that captions provide. Captions might have more details, like sound descriptions. Hence, the difference stands on the "data" provided to be shown. The typical situation can be idealized with the scheme below:
Once data becomes identified and not just random binary data, this data can be seen as a "track", a sequence of data with a specific meaning that is subsequently ordered.
As we were saying, Tracks are therefore composed of a non-linearly-distributed set of cues.
Each cue has its own properties, which may or may not differ from the properties of another cue, but the most important ones are the starting time, ending time and the content itself.
It comes naturally to think, then, that some cues might therefore overlap and be shown at the same time or with some time offset from the previous one.
We might draw them down. What comes out? A timeline like those on video editing software.
So, the principle is that when we are at a certain point, we must show all the cues belonging to that moment. Hence, we need to check for a suitable data structure that will be able to contain cues in the correct way and allow us to retrieve only those we are looking for, at a specific time.
After some researches, the most suitable structure to be adopted felt to be the Interval Binary Tree because, unlike normal binary trees, each node (or leaf) owns its own starting point and its own ending point. Also, each node has a maximum value consisting of the maximum value of its subtree, so it can easily tell if its subtree should be queried or not to look for possible cues.
In our case, the maximum value is the ending time of the cue. In the IBT, we are going to add a left leaf if the new node low
(in our case the startTime
) is lower than the node's low
. We are going to add it a right leaf otherwise.
Each node owns a "max value" which, for the current implementation, represents the max between its own high
and the next node that is going to be inserted. Hence, if a new longer node is being inserted, all the parent nodes will update with a new maximum.
This structure allows us to protect ourselves from unordered cues, just like the example above, where the yellow cue is the first getting inserted (and, therefore, the uppermost root).
It comes by itself that if all the cues are ordered and perfectly sequential, a right-only IBT will be created.
Querying the IBT in the example for the range [ 15.300, 18.100 ]
(300ms and 100ms), the interrogation will proceed only right and retrieve the root and the first children on right.